{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "aklJxkHBD5aR"
   },
   "source": [
    "# Predicting babyweight using BigQuery ML\n",
    "\n",
    "This notebook illustrates:\n",
    "<ol>\n",
    "<li> Machine Learning using BigQuery\n",
    "<li> Make a Prediction with BQML using the Model\n",
    "</ol>\n",
    "\n",
    "Please see [this notebook](1_explore.ipynb) for more context on this problem and how the features were chosen."
   ]
  },
  {
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
 "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
 "# Ensure the right version of Tensorflow is installed.\n",
 "!pip freeze | grep tensorflow==2.1"
 ]
},
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     }
    },
    "colab_type": "code",
    "id": "BhUiclqCD5aT"
   },
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "PROJECT = 'cloud-training-demos'\n",
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     }
    },
    "colab_type": "code",
    "id": "F10_KsX7D5aX"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['REGION'] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     }
    },
    "colab_type": "code",
    "id": "YrqwSv6vD5aZ",
    "outputId": "402ed11b-ab4a-4480-a1aa-f72741cd2cc1"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Updated property [core/project].\n",
      "Updated property [compute/region].\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud config set project $PROJECT\n",
    "gcloud config set compute/region $REGION"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "j782YaAzD5ae"
   },
   "source": [
    "## Exploring the Data\n",
    "\n",
    "Here, we will be taking natality data and training on features to predict the birth weight.\n",
    "\n",
    "The CDC's Natality data has details on US births from 1969 to 2008 and is available in BigQuery as a public data set. More details: https://bigquery.cloud.google.com/table/publicdata:samples.natality?tab=details\n",
    "\n",
    "Lets start by looking at the data since 2000 with useful values > 0!"
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "  publicdata.samples.natality\n",
    "WHERE\n",
    "  year > 2000\n",
    "  AND gestation_weeks > 0\n",
    "  AND mother_age > 0\n",
    "  AND plurality > 0\n",
    "  AND weight_pounds > 0\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ep02J20pD5ai"
   },
   "source": [
    "## Define Features\n",
    "\n",
    "Looking over the data set, there are a few columns of interest that could be leveraged into features for a reasonable prediction of approximate birth weight.\n",
    "\n",
    "Further, some feature engineering may be accomplished with the BigQuery `CAST` function -- in BQML, all strings are considered categorical features and all numeric types are considered continuous ones.\n",
    "\n",
    "The hashmonth is added so that we can repeatably split the data without leakage -- we want all babies that share a birthday to be either in training set or in test set and not spread between them (otherwise, there would be information leakage when it comes to triplets, etc.)"
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
   "source": [
    "%%bigquery\n",
    "SELECT\n",
    "    weight_pounds, -- this is the label; because it is continuous, we need to use regression\n",
    "    CAST(is_male AS STRING) AS is_male,\n",
    "    mother_age,\n",
    "    CAST(plurality AS STRING) AS plurality,\n",
    "    gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "FROM\n",
    "  publicdata.samples.natality\n",
    "WHERE\n",
    "  year > 2000\n",
    "  AND gestation_weeks > 0\n",
    "  AND mother_age > 0\n",
    "  AND plurality > 0\n",
    "  AND weight_pounds > 0\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "H-d7C8KcD5am"
   },
   "source": [
    "## Train Model\n",
    "\n",
    "With the relevant columns chosen to accomplish predictions, it is then possible to create (train) the model in BigQuery. First, a dataset will be needed store the model. (if this throws an error in Datalab, simply create the dataset from the BigQuery console)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     }
    },
    "colab_type": "code",
    "id": "bWMZJQdKD5an"
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "bq --location=US mk -d demo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "T9JZIQ9nD5ap"
   },
   "source": [
    "With the demo dataset ready, it is possible to create a linear regression model to train the model.\n",
    "\n",
    "This will take approximately **4 minutes** to run and will show **Done** when complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     }
    },
    "colab_type": "code",
    "id": "V-UJOX0tD5aq",
    "outputId": "63aa219c-a14f-4db5-a271-2e614759fe1a"
   },
   "outputs": [],
   "source": [
    "%%bigquery\n",
    "CREATE or REPLACE MODEL demo.babyweight_model_asis\n",
    "OPTIONS\n",
    "  (model_type='linear_reg', labels=['weight_pounds']) AS\n",
    "  \n",
    "WITH natality_data AS (\n",
    "  SELECT\n",
    "    weight_pounds,-- this is the label; because it is continuous, we need to use regression\n",
    "    CAST(is_male AS STRING) AS is_male,\n",
    "    mother_age,\n",
    "    CAST(plurality AS STRING) AS plurality,\n",
    "    gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "  FROM\n",
    "    publicdata.samples.natality\n",
    "  WHERE\n",
    "    year > 2000\n",
    "    AND gestation_weeks > 0\n",
    "    AND mother_age > 0\n",
    "    AND plurality > 0\n",
    "    AND weight_pounds > 0\n",
    ")\n",
    "\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks\n",
    "FROM\n",
    "    natality_data\n",
    "WHERE\n",
    "  ABS(MOD(hashmonth, 4)) < 3  -- select 75% of the data as training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mJYtGxWyD5at"
   },
   "source": [
    "## Training Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "0L92s-gtD5au"
   },
   "source": [
    "During the model training (and after the training), it is possible to see the model's training evaluation statistics.\n",
    "\n",
    "For each training run, a table named `<model_name>_eval` is created. This table has basic performance statistics for each iteration.\n",
    "\n",
    "While the new model is training, review the training statistics in the BigQuery UI to see the below model training: https://bigquery.cloud.google.com/\n",
    "\n",
    "Since these statistics are updated after each iteration of model training, you will see different values for each refresh while the model is training.\n",
    "\n",
    "The training details may also be viewed after the training completes from this notebook."
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
 "source": [
    "%%bigquery\n",
    "SELECT * FROM ML.TRAINING_INFO(MODEL demo.babyweight_model_asis);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "CATWu60vD5ay"
   },
   "source": [
    "Some of these columns are obvious although what do the non-specific ML columns mean (specific to BQML)?\n",
    "\n",
    "**training_run** - Will be zero for a newly created model. If the model is re-trained using warm_start, this will increment for each re-training.\n",
    "\n",
    "**iteration** - Number of the associated `training_run`, starting with zero for the first iteration.\n",
    "\n",
    "**duration_ms** - Indicates how long the iteration took (in ms).\n",
    "\n",
    "Note: You can also see these stats by refreshing the BigQuery UI window, finding the `<model_name>` table, selecting on it, and then the Training Stats sub-header.\n",
    "\n",
    "Observe the training and evaluation loss and see if the model has an overfit."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9QPwMA_ZD5a3"
   },
   "source": [
    "## Make a Prediction with BQML using the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "9QBZxMb6D5a5"
   },
   "source": [
    "With a trained model, it is now possible to make a prediction on the values. The only difference from the second query above is the reference to the model. The data has been limited (`LIMIT 100`) to reduce amount of data returned.\n",
    "\n",
    "When the `ml.predict` function is leveraged, output prediction column name for the model is `predicted_<label_column_name>`."
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
    "%%bigquery\n",
    "SELECT\n",
    "  *\n",
    "FROM\n",
    "  ml.PREDICT(MODEL demo.babyweight_model_asis,\n",
    "      (SELECT\n",
    "        weight_pounds,\n",
    "        CAST(is_male AS STRING) AS is_male,\n",
    "        mother_age,\n",
    "        CAST(plurality AS STRING) AS plurality,\n",
    "        gestation_weeks\n",
    "      FROM\n",
    "        publicdata.samples.natality\n",
    "      WHERE\n",
    "        year > 2000\n",
    "        AND gestation_weeks > 0\n",
    "        AND mother_age > 0\n",
    "        AND plurality > 0\n",
    "        AND weight_pounds > 0\n",
    "    ))\n",
    "LIMIT 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "0DZorSI-D5a9"
   },
   "source": [
    "# More advanced...\n",
    "\n",
    "In the original example, we were taking into account the idea that if no ultrasound has been performed, some of the features (e.g. is_male) will not be known. Therefore, we augmented the dataset with such masked features and trained a single model to deal with both these scenarios.\n",
    "\n",
    "In addition, during data exploration, we learned that the data size set for mothers older than 45 was quite sparse, so we will discretize the mother age."
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
    "%%bigquery\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    CAST(is_male AS STRING) AS is_male,\n",
    "    IF(mother_age < 18, 'LOW',\n",
    "         IF(mother_age > 45, 'HIGH',\n",
    "            CAST(mother_age AS STRING))) AS mother_age,\n",
    "    CAST(plurality AS STRING) AS plurality,\n",
    "    CAST(gestation_weeks AS STRING) AS gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "  FROM\n",
    "    publicdata.samples.natality\n",
    "  WHERE\n",
    "    year > 2000\n",
    "    AND gestation_weeks > 0\n",
    "    AND mother_age > 0\n",
    "    AND plurality > 0\n",
    "    AND weight_pounds > 0\n",
    "LIMIT 25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZwetSzIeD5bC"
   },
   "source": [
    "On the same dataset, will also suppose that it is unknown whether the child is male or female (on the same dataset) to simulate that an ultrasound was not been performed."
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
    "%%bigquery\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    'Unknown' AS is_male,\n",
    "    IF(mother_age < 18, 'LOW',\n",
    "         IF(mother_age > 45, 'HIGH',\n",
    "            CAST(mother_age AS STRING))) AS mother_age,\n",
    "    IF(plurality > 1, 'Multiple', 'Single') AS plurality,\n",
    "    CAST(gestation_weeks AS STRING) AS gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "  FROM\n",
    "    publicdata.samples.natality\n",
    "  WHERE\n",
    "    year > 2000\n",
    "    AND gestation_weeks > 0\n",
    "    AND mother_age > 0\n",
    "    AND plurality > 0\n",
    "    AND weight_pounds > 0\n",
    "LIMIT 25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "MmY2chPyD5bI"
   },
   "source": [
    "Bringing these two separate data sets together, there is now a dataset for male or female children determined with ultrasound or unknown if without. "
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
    "%%bigquery\n",
    "WITH with_ultrasound AS (\n",
    "  SELECT\n",
    "    weight_pounds,\n",
    "    CAST(is_male AS STRING) AS is_male,\n",
    "    IF(mother_age < 18, 'LOW',\n",
    "         IF(mother_age > 45, 'HIGH',\n",
    "            CAST(mother_age AS STRING))) AS mother_age,\n",
    "    CAST(plurality AS STRING) AS plurality,\n",
    "    CAST(gestation_weeks AS STRING) AS gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "  FROM\n",
    "    publicdata.samples.natality\n",
    "  WHERE\n",
    "    year > 2000\n",
    "    AND gestation_weeks > 0\n",
    "    AND mother_age > 0\n",
    "    AND plurality > 0\n",
    "    AND weight_pounds > 0\n",
    "),\n",
    "\n",
    "without_ultrasound AS (\n",
    "  SELECT\n",
    "    weight_pounds,\n",
    "    'Unknown' AS is_male,\n",
    "    IF(mother_age < 18, 'LOW',\n",
    "         IF(mother_age > 45, 'HIGH',\n",
    "            CAST(mother_age AS STRING))) AS mother_age,\n",
    "    IF(plurality > 1, 'Multiple', 'Single') AS plurality,\n",
    "    CAST(gestation_weeks AS STRING) AS gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "  FROM\n",
    "    publicdata.samples.natality\n",
    "  WHERE\n",
    "    year > 2000\n",
    "    AND gestation_weeks > 0\n",
    "    AND mother_age > 0\n",
    "    AND plurality > 0\n",
    "    AND weight_pounds > 0\n",
    "),\n",
    "\n",
    "preprocessed AS (\n",
    "  SELECT * from with_ultrasound\n",
    "  UNION ALL\n",
    "  SELECT * from without_ultrasound\n",
    ")\n",
    "\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks\n",
    "FROM\n",
    "    preprocessed\n",
    "WHERE\n",
    "  ABS(MOD(hashmonth, 4)) < 3\n",
    "LIMIT 25"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hNnZxDw9D5bN"
   },
   "source": [
    "## Create a new model\n",
    "\n",
    "With a data set which has been feature engineered, it is ready to create model with the `CREATE or REPLACE MODEL` statement\n",
    "\n",
    "This will take **5-10 minutes** and will show **Done** when complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "colab": {
     "autoexec": {
      "startup": false,
      "wait_interval": 0
     }
    },
    "colab_type": "code",
    "id": "aKPqXqicD5bP"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "Done"
      ],
      "text/plain": [
       "QueryResultsTable job_rSeIYeqqsx3sv-reWIaqUcF7HQHb"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE or REPLACE MODEL demo.babyweight_model_fc\n",
    "OPTIONS\n",
    "  (model_type='linear_reg', labels=['weight_pounds']) AS\n",
    "  \n",
    "WITH with_ultrasound AS (\n",
    "  SELECT\n",
    "    weight_pounds,\n",
    "    CAST(is_male AS STRING) AS is_male,\n",
    "    IF(mother_age < 18, 'LOW',\n",
    "         IF(mother_age > 45, 'HIGH',\n",
    "            CAST(mother_age AS STRING))) AS mother_age,\n",
    "    CAST(plurality AS STRING) AS plurality,\n",
    "    CAST(gestation_weeks AS STRING) AS gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "  FROM\n",
    "    publicdata.samples.natality\n",
    "  WHERE\n",
    "    year > 2000\n",
    "    AND gestation_weeks > 0\n",
    "    AND mother_age > 0\n",
    "    AND plurality > 0\n",
    "    AND weight_pounds > 0\n",
    "),\n",
    "\n",
    "without_ultrasound AS (\n",
    "  SELECT\n",
    "    weight_pounds,\n",
    "    'Unknown' AS is_male,\n",
    "    IF(mother_age < 18, 'LOW',\n",
    "         IF(mother_age > 45, 'HIGH',\n",
    "            CAST(mother_age AS STRING))) AS mother_age,\n",
    "    IF(plurality > 1, 'Multiple', 'Single') AS plurality,\n",
    "    CAST(gestation_weeks AS STRING) AS gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "  FROM\n",
    "    publicdata.samples.natality\n",
    "  WHERE\n",
    "    year > 2000\n",
    "    AND gestation_weeks > 0\n",
    "    AND mother_age > 0\n",
    "    AND plurality > 0\n",
    "    AND weight_pounds > 0\n",
    "),\n",
    "\n",
    "preprocessed AS (\n",
    "  SELECT * from with_ultrasound\n",
    "  UNION ALL\n",
    "  SELECT * from without_ultrasound\n",
    ")\n",
    "\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks\n",
    "FROM\n",
    "    preprocessed\n",
    "WHERE\n",
    "  ABS(MOD(hashmonth, 4)) < 3"
   ]
  },

  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mJYtGxWyD5at"
   },
   "source": [
    "## Training Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "0L92s-gtD5au"
   },
   "source": [
    "While the new model is training, review the training statistics in the BigQuery UI to see the below model training: https://bigquery.cloud.google.com/\n",
    "\n",
    "The training details may also be viewed after the training completes from this notebook."
   ]
  },

  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
 "source": [
    "%%bigquery\n",
    "SELECT * FROM ML.TRAINING_INFO(MODEL demo.babyweight_model_fc);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "1jlHzJqfD5bU"
   },
   "source": [
    "## Make a prediction with the new model\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "EGo8fj9qD5bU"
   },
   "source": [
    "Perhaps it is of interest to make a prediction of the baby's weight given a number of other factors: Male, Mother is 28 years old, Mother will only have one child, and the baby was born after 38 weeks of pregnancy.\n",
    "\n",
    "To make this prediction, these values will be passed into the SELECT statement."
   ]
  },
  {

"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
    "%%bigquery\n",
    "SELECT\n",
    "  *\n",
    "FROM\n",
    "  ml.PREDICT(MODEL demo.babyweight_model_fc,\n",
    "      (SELECT\n",
    "          'True' AS is_male,\n",
    "          '28' AS mother_age,\n",
    "          '1' AS plurality,\n",
    "          '38' AS gestation_weeks\n",
    "    ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "PK_-WNGUD5bX"
   },
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "default_view": {},
   "name": "babyweight_bqml.ipynb",
   "provenance": [],
   "version": "0.3.2",
   "views": {}
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
